AITopics | convergence proof

Collaborating Authors

convergence proof

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Practical Sharpness-Aware Minimization Cannot Converge All the Way to Optima

Neural Information Processing SystemsFeb-19-2026, 11:25:20 GMT

To address this gap, we study deterministic/stochastic versions of SAM with practical configurations (i.e., constant

artificial intelligence, machine learning, theorem 4, (15 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.92)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Practical Sharpness-Aware Minimization Cannot Converge All the Way to Optima

Neural Information Processing SystemsOct-8-2025, 17:01:44 GMT

To address this gap, we study deterministic/stochastic versions of SAM with practical configurations (i.e., constant

artificial intelligence, machine learning, theorem 4, (15 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.92)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing SystemsOct-3-2025, 01:08:46 GMT

CRFs are an example of such a setting (it is relatively slow to compute the function value and/or gradient of an instance). The main contribution is a superlinear convergence proof of the algorithm under the assumption of strong convexity in a subspace of the original optimization.

algorithm, experiment, iteration, (12 more...)

Neural Information Processing Systems

Country: North America > Canada > Quebec > Montreal (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing SystemsOct-2-2025, 19:52:35 GMT

First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. This paper derives policy gradient algorithms for risk-sensitive MDPs for the particular criterion CVaR - a recent and popular criterion. First, the author derive gradients for the objective based on a Lagrangian relaxation of the constrained optimization. This naturally turns into a policy gradient algorithm where the expected return that appears in the gradient is estimated from full trajectories (reinforce-like). They then propose a scheme to obtain incremental actor-critic versions, where the critic computes the value (and other quantities) of an augmented MDP convenient for gradient estimation.

algorithm, contribution, experiment, (11 more...)

Neural Information Processing Systems

Country: North America > Canada > Quebec > Montreal (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.97)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.35)

Add feedback

Simple Convergence Proof of Adam From a Sign-like Descent Perspective

Peng, Hanyang, Qin, Shuang, Yu, Yue, Jiang, Fangqing, Wang, Hui, Lin, Zhouchen

arXiv.org Artificial IntelligenceJul-10-2025

Adam is widely recognized as one of the most effective optimizers for training deep neural networks (DNNs). Despite its remarkable empirical success, its theoretical convergence analysis remains unsatisfactory. Existing works predominantly interpret Adam as a preconditioned stochastic gradient descent with momentum (SGDM), formulated as $\bm{x}_{t+1} = \bm{x}_t - \frac{γ_t}{{\sqrt{\bm{v}_t}+ε}} \circ \bm{m}_t$. This perspective necessitates strong assumptions and intricate techniques, resulting in lengthy and opaque convergence proofs that are difficult to verify and extend. In contrast, we propose a novel interpretation by treating Adam as a sign-like optimizer, expressed as $\bm{x}_{t+1} = \bm{x}_t - γ_t \frac{|\bm{m}_t|}{{\sqrt{\bm{v}_t}+ε}} \circ {\rm Sign}(\bm{m}_t)$. This reformulation significantly simplifies the convergence analysis. For the first time, with some mild conditions, we prove that Adam achieves the optimal rate of ${\cal O}(\frac{1}{T^{\sfrac{1}{4}}})$ rather than the previous ${\cal O} \left(\frac{\ln T}{T^{\sfrac{1}{4}}}\right)$ under weak assumptions of the generalized $p$-affine variance and $(L_0, L_1, q)$-smoothness, without dependence on the model dimensionality or the numerical stability parameter $ε$. Additionally, our theoretical analysis provides new insights into the role of momentum as a key factor ensuring convergence and offers practical guidelines for tuning learning rates in Adam, further bridging the gap between theory and practice.

artificial intelligence, deep learning, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2507.05966

Country: Asia > China (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information

Hwang, Dongseong

arXiv.org Artificial IntelligenceJul-9-2024

This paper establishes a mathematical foundation for the Adam optimizer, elucidating its connection to natural gradient descent through Riemannian and information geometry. We rigorously analyze the diagonal empirical Fisher information matrix (FIM) in Adam, clarifying all detailed approximations and advocating for the use of log probability functions as loss, which should be based on discrete distributions, due to the limitations of empirical FIM. Our analysis uncovers flaws in the original Adam algorithm, leading to proposed corrections such as enhanced momentum calculations, adjusted bias corrections, adaptive epsilon, and gradient clipping. We refine the weight decay term based on our theoretical framework. Our modified algorithm, Fisher Adam (FAdam), demonstrates superior performance across diverse domains including LLM, ASR, and VQ-VAE, achieving state-of-the-art results in ASR.

fim, gradient, natural gradient, (15 more...)

arXiv.org Artificial Intelligence

2405.12807

Country:

North America > United States > Indiana > Hamilton County > Fishers (0.04)
North America > United States > California > Santa Clara County > Mountain View (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

An inexact Bregman proximal point method and its acceleration version for unbalanced optimal transport

Chen, Xiang, Wang, Faqiang, Liu, Jun, Cui, Li

arXiv.org Artificial IntelligenceFeb-26-2024

The Unbalanced Optimal Transport (UOT) problem plays increasingly important roles in computational biology, computational imaging and deep learning. Scaling algorithm is widely used to solve UOT due to its convenience and good convergence properties. However, this algorithm has lower accuracy for large regularization parameters, and due to stability issues, small regularization parameters can easily lead to numerical overflow. We address this challenge by developing an inexact Bregman proximal point method for solving UOT. This algorithm approximates the proximal operator using the Scaling algorithm at each iteration. The algorithm (1) converges to the true solution of UOT, (2) has theoretical guarantees and robust regularization parameter selection, (3) mitigates numerical stability issues, and (4) can achieve comparable computational complexity to the Scaling algorithm in specific practice. Building upon this, we develop an accelerated version of inexact Bregman proximal point method for solving UOT by using acceleration techniques of Bregman proximal point method and provide theoretical guarantees and experimental validation of convergence and acceleration.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2402.16978

Country:

Asia > China > Beijing > Beijing (0.04)
Europe > France (0.04)

Genre: Research Report (0.50)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.34)

Add feedback

Practical Sharpness-Aware Minimization Cannot Converge All the Way to Optima

Si, Dongkuk, Yun, Chulhee

arXiv.org Machine LearningOct-27-2023

Sharpness-Aware Minimization (SAM) is an optimizer that takes a descent step based on the gradient at a perturbation $y_t = x_t + \rho \frac{\nabla f(x_t)}{\lVert \nabla f(x_t) \rVert}$ of the current point $x_t$. Existing studies prove convergence of SAM for smooth functions, but they do so by assuming decaying perturbation size $\rho$ and/or no gradient normalization in $y_t$, which is detached from practice. To address this gap, we study deterministic/stochastic versions of SAM with practical configurations (i.e., constant $\rho$ and gradient normalization in $y_t$) and explore their convergence properties on smooth functions with (non)convexity assumptions. Perhaps surprisingly, in many scenarios, we find out that SAM has limited capability to converge to global minima or stationary points. For smooth strongly convex functions, we show that while deterministic SAM enjoys tight global convergence rates of $\tilde \Theta(\frac{1}{T^2})$, the convergence bound of stochastic SAM suffers an inevitable additive term $O(\rho^2)$, indicating convergence only up to neighborhoods of optima. In fact, such $O(\rho^2)$ factors arise for stochastic SAM in all the settings we consider, and also for deterministic SAM in nonconvex cases; importantly, we prove by examples that such terms are unavoidable. Our results highlight vastly different characteristics of SAM with vs. without decaying perturbation size or gradient normalization, and suggest that the intuitions gained from one version may not apply to the other.

artificial intelligence, machine learning, theorem 4, (15 more...)

arXiv.org Machine Learning

2306.0985

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

A Convergence Proof for the Softassign Quadratic Assignment Algorithm

Neural Information Processing SystemsApr-6-2023, 18:13:49 GMT

The softassign quadratic assignment algorithm has recently emerged as an effective strategy for a variety of optimization prob(cid:173) lems in pattern recognition and combinatorial optimization. While the effectiveness of the algorithm was demonstrated in thousands of simulations, there was no known proof of convergence. Here, we provide a proof of convergence for the most general form of the algorithm.

convergence proof, softassign quadratic assignment algorithm

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (0.40)

Add feedback

Incremental Natural Actor-Critic Algorithms

Neural Information Processing SystemsApr-6-2023, 14:42:01 GMT

We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic rein- forcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their com- patibility with function approximation methods, which are needed to handle large or in(cid:2)nite state spaces. The use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further re- duce variance in some cases.

actor-critic method, convergence proof, incremental natural actor-critic algorithm, (3 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.63)

Add feedback